Back

Journal of the American Medical Informatics Association

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match Journal of the American Medical Informatics Association's content profile, based on 61 papers previously published here. The average preprint has a 0.12% match score for this journal, so anything above that is already an above-average fit.

1
Evaluating Large Language Models for Translating Multimodal Phenotype Documentations into Executable EHR Phenotyping Algorithms

Yan, C.; Xin, Y.; Su, W.-C.; Gangireddy, S.; Durbhakula, S.; Bruehl, S. P.; Dickson, A. L.; Li, L.; Feng, Q.; Malin, B. A.; Derr, T.; Wei, W.-Q.

2026-05-22 health informatics 10.64898/2026.05.20.26353690 medRxiv
Top 0.1%
41.1%
Show abstract

Research applications of electronic health record (EHR) phenotypes require translating clinical definitions into executable EHR database queries, a labor-intensive process. We evaluated two frontier large language models across five phenotypes and three documentation modalities. Both models captured high-level logic from structured text but degraded markedly with diagram-only input. Error analysis revealed seven failure categories. Documentation, rather than model capability, was the primary bottleneck, reinforcing the need for standardization and expert oversight.

2
Characterizing Documented Psychosocial Stressors in Pediatric Psychiatric Emergencies with an Open-Weight Large Language Model

Hartlage, C. S.; Manning, E. R.; Bernard, J.; Vaish, S.; Gray, J.; Young, M.; Pestian, T.; Folger, A. T.; Tachinardi, P.; Mendonca, E. A.; Brokamp, C.

2026-06-09 health informatics 10.64898/2026.06.08.26354931 medRxiv
Top 0.1%
40.2%
Show abstract

Objective: To evaluate whether a locally hosted open-weight large language model (LLM) can extract documented psychosocial factors from pediatric psychiatric intake notes and apply validated extraction to a large emergency psychiatry cohort. Materials and Methods: We identified emergency department presentations at Cincinnati Children's Hospital Medical Center from January 1, 2016, through December 31, 2024, among patients younger than 18 years with psychiatric billing diagnoses. Using full-text intake notes, gpt-oss:120b classified peer conflict, sleep disruption, and school-related academic, attendance, and disciplinary issues as detected, negated, or indeterminate. Four human raters independently reviewed 50 notes. We compared Fleiss' kappa among humans alone versus humans plus the LLM, assessed repeated-query stability across 50 independent calls per note, and applied the workflow to all eligible notes. Results: Among 37,315 eligible admissions, 22,284 had eligible intake notes; 22,270 produced parseable JSON. In detected-versus-not-detected coding, human-plus-LLM reliability did not differ significantly from human-only reliability across measures (human {kappa} 0.71-0.94; human-plus-LLM {kappa} 0.70-0.93). Stability was associated with human agreement: mean LLM-human agreement increased from 42.6% for classifications with less than 80% stability to 82.7% for classifications with 100% stability (Pearson r = 0.36). Full-cohort extraction showed frequent and overlapping documented factors: sleep disruption was most frequently detected (57.7%), followed by peer conflict (47.2%), academic issues (43.4%), disciplinary issues (43.3%), and attendance issues (16.9%). Discussion: Agreement varied by construct and was strongest when repeated model outputs were stable. Conclusion: Locally hosted open-weight LLMs can support scalable structured extraction of documented psychosocial factors from pediatric psychiatric intake notes after local validation.

3
Real-world impact of a sepsis early detection model integrated into clinical workflow: a quasi-experimental study

Zhang, Y.; Trinh, S. H.; Phelan, T.; Byrd, T. F.; Tourani, R.; Kumar, V.; Caraballo, P. J.; Melton, G. B.; Simon, G. J.

2026-06-01 health informatics 10.64898/2026.05.22.26353890 medRxiv
Top 0.1%
40.0%
Show abstract

Background: Sepsis is a life-threatening condition in which delayed recognition and treatment are associated with increased mortality. While predictive models such as Epic's Early Detection of Sepsis Model (ESM) were developed to support early intervention, their real-world impact after integration into clinical workflows remains difficult to evaluate. Objectives: To evaluate the real-world impact of ESM integrated into clinical workflow on clinical outcomes, antibiotic use, and harm-benefit tradeoffs. Methods: We conducted a quasi-experimental study in a single healthcare system using encounter-level data from inpatient settings. Inpatient mortality, prolonged hospitalization, antibiotic use, and sepsis prevalence were compared between the pre-implementation period (3 June 2023 to 20 August 2024) and the online period (21 August 2024 to 26 December 2024) when the model became visible to clinicians. We also applied a counterfactual framework using models trained on pre-implementation data to estimate expected outcomes without ESM and to quantify harms related to overtreatment and delayed treatment. Results: Among 101,138 encounters, 86,884 occurred during the pre-implementation period and 14,254 during the online period. In unadjusted analyses, the online period had lower inpatient mortality, prolonged hospitalization, antibiotic use, and sepsis prevalence (all p[&le;]0.002). In the counterfactual analyses, observed outcomes were lower than expected without ESM for mortality (1.21% vs 1.82%; p<0.001), prolonged hospitalization (5.56% vs 7.95%; p<0.001), and antibiotic use (43.52% vs 47.04%; p<0.001). False positive harm (37.72% vs 41.68%; p<0.001) was also lower than expected. Conclusions: Integration of ESM into clinical workflow was associated with improved patient outcomes, reduced antibiotic use, and decreased harm from overtreatment, without evidence of increased harm from delayed treatment, supporting a positive net clinical benefit and the safety and effectiveness of ESM under Software as a Medical Device principles. Keywords: Machine learning, Electronic health records, Clinical workflow, Counterfactual analysis, Real-world evaluation

4
Augmenting Structured Diagnoses through Effective Use of Pre-trained Large Language Models on Clinical Notes

Razzaghi, H.; Nguyen, N.; Pargi, M.; Wieand, K.; Bunnell, T.; Bailey, C.

2026-06-02 health informatics 10.64898/2026.05.30.26354533 medRxiv
Top 0.1%
39.7%
Show abstract

Objective Clinical narrative provides a unique window into provider reasoning and attribution, but use has been limited by resource requirements and extensive fine-tuning, and LLMs in particular have traditionally not performed well at medical coding. We optimize and evaluate a reproducible method for automated diagnosis assignment using LLMs in clinical notes and compare with EHR structured diagnoses. Methods We used GPT-OSS for prompt engineering and task segmentation to create a model that extracts ICD-10-CM diagnoses, with estimates of severity, currency, and importance, from progress notes. We assessed performance across multiple cohorts of patients aged 0-21 years. For each, 100 outpatient provider notes were selected across levels of severity, along with coded diagnoses from that visit (EHR); a subset of 130 notes were subjected to clinical expert review. Results Comparison showed 18.7% exact code and 33.3% ICD-10-CM category match between EHR and LLM, but semantic similarity of 0.93 at the category level. Compared to expert review, LLM precision was 0.84 and recall 0.49 for exact matches, and 0.92 and 0.62, respectively, for category-level matching. In contrast, EHR coded diagnoses showed slightly higher precision (0.94 for both cases) and substantially lower recall (0.27 and 0.43) versus expert review. Codes not identified by the LLM were more often rated by the reviewer as lower importance or certainty. Conclusion We demonstrate a reusable approach to optimizing a pretrained LLM for use in diagnosis extraction from clinical notes, facilitating large-scale diagnosis screening by LLMs without the need for expensive study-specific model refinement.

5
PheBee: A Graph-Aware System for Scalable, Traceable, and Semantic Phenotyping

Gordon, D. M.; Homilius, M.; Antoniou, A. A.; Grannis, C.; Lammi, G. E.; Herman, A. C.; Kubatko, A.; Chaudhari, B. P.; White, P.

2026-05-13 health informatics 10.64898/2026.05.09.26352812 medRxiv
Top 0.1%
35.2%
Show abstract

ObjectivesPhenotype-driven workflows in clinical and translational research require standardized ontology-based representation, ontology-aware cohort discovery, and provenance inspection for each assertion. Existing approaches optimize either for semantic traversal or scalable batch analytics, but not both. We describe PheBee, a hybrid system that links semantic assertions to scalable evidence storage via a deterministic identifier, preserving provenance while supporting ontology-aware discovery at cohort scale. Materials and MethodsPheBee represents phenotype assertions in a knowledge graph as ontology-linked nodes with clinical modifier context (e.g., negated, family history), and stores supporting evidence records in a scalable row-oriented evidence table for cohort-scale access. The two layers are connected by a deterministic identifier enabling stable joins across repeated ingestions without duplicating high-volume evidence in the graph. We evaluated PheBee using synthetic datasets designed to exercise end-to-end ingestion and query workflows. ResultsFunctional evaluation validated hierarchical term expansion, qualifier-aware retrieval, duplicate-free assertion handling under re-ingestion, and privacy-conscious management of subjects shared across multiple research projects. At scale (10,000 subjects producing 12M evidence records) PheBee completed ingestion in [~]30 minutes and responded to interactive queries within 6 seconds under concurrent load. DiscussionPheBee exposes a unified API for ontology-aware cohort discovery with hierarchical term expansion, subject-centric retrieval of phenotypes and clinical modifiers, and evidence and provenance queries. Its data model aligns with GA4GH Phenopackets, facilitating interoperability with phenotype exchange standards. ConclusionBy combining ontology-aware semantics with scalable, provenance-bearing evidence storage, PheBee provides a practical open-source foundation for phenotype-driven research workflows that demand both semantic precision and cohort-scale traceability. LAY SUMMARYResearchers often use "phenotypes" (observable clinical features) to describe individual subjects and find groups of similar subjects. Those phenotypes come from many sources and need both standard terminology and clear evidence for why a phenotype has been associated with a subject. PheBee is a software system that stores phenotype assertions in a way that supports both "ontology-aware" searching (for example, finding patients with any subtype of a condition) and scalable storage of supporting evidence across large research cohorts. PheBee uses multiple types of data storage so researchers can perform interactive phenotype searches and also store millions of pieces of supporting evidence. A shared identifier connects the two storage layers, so subjects phenotypes and their supporting evidence remain linked even as new data is added over time. We evaluated PheBee using fully synthetic (non-patient) data to confirm correct query behavior, evidence traceability, and system performance at large scale.

6
A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Proulx, J.; Daines, B.; Barton, M.; Leonard, M. E.; Garcia, J. A.; Young, B.; Snell, Q.; West, T. W.; Watson, S. R.; AlQaseer, M.; Louiset, M.; Maqsood, M. B.; Voutt-Goos, M. J.; Douma, C.; Kasbekar, N.; Jeffries, J.; Abu-Rahmeh, W.; Frush, K.; Grewal, D. K.; Bahsoun, M.; Leonard, M.; Frankel, A.; Classen, D. C.; Pestotnik, S. L.

2026-06-10 health informatics 10.64898/2026.06.05.26354271 medRxiv
Top 0.1%
32.9%
Show abstract

Objective. To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting. Materials and Methods. PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026. Results. Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper. Discussion and Conclusion. PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.

7
Machine Learning Estimation of Gestational Age at Delivery Using Linked Mother-Infant Electronic Health Records Across Two Health Systems

Bejan, C. A.; Yang, X.; Pham, A.; Qassem, L.; Abraham, A. A.; Choi, L.; Rosenbloom, S. T.; Gamire, L. X.; Phillips, E. J.

2026-05-25 obstetrics and gynecology 10.64898/2026.05.23.26353959 medRxiv
Top 0.1%
32.9%
Show abstract

Objective This study aimed to train and evaluate supervised machine learning algorithms using electronic health record (EHR) data to accurately estimate gestational age at delivery. <br>Materials and Methods We trained random forest, gradient boosting, and ensemble models on EHR data of mother-infant dyads from Vanderbilt University Medical Center(VUMC) and replicated the analyses at University of Michigan (UMich). We further analyzed EHR predictors of gestational age, assessed temporal drift in EHR data elements, and evaluated model performance stratified by delivery status. <br>Results The study included pregnancies corresponding to 54,344 and 34,345 mother-infant dyads at VUMC (2005-2025) and UMich (2012-2024), respectively. The gestational age predictions of the ensemble models achieved the highest agreement with the reference standard on the VUMC dataset ({+/-}1 week: 85.2%, {+/-}2 weeks: 94.3%, MAE: 4.4 days) and demonstrated stronger generalization on the UMich dataset ({+/-}1 week: 93.1%, {+/-}2 weeks: 97.8%, MAE: 2.8 days). Further, performance was better among pregnancies delivered in more recent years, and among full- and late-term deliveries compared with preterm deliveries. <br>Discussion The results indicate that supervised machine learning methods leveraging linked mother-infant EHRs can accurately estimate gestational age at delivery, while demonstrating the generalizability of the modeling approach and the portability of the analytic workflow across healthcare sites. <br>Conclusion This study presents a robust and generalizable machine learning framework to estimate gestational age at delivery. The framework can be reliably used to impute gestational age in large-scale, real-world clinical studies to support maternal and neonatal health research, in which accurate estimation of pregnancy onset is critical.

8
Operationalizing Eight-Dimensional Patient-Safety Risk Scoring at Scale: A Multi-Model Large Language Model Reliability Study

LIn, H.-M.; Lyu, J.; Wang, I.-L.

2026-06-01 health informatics 10.64898/2026.05.29.26354437 medRxiv
Top 0.1%
27.8%
Show abstract

Background: Hospital incident risk scoring has long relied on two- or three-dimensional frameworks (Severity Assessment Codes or Risk Priority Numbers),even though root cause analysis standards recognize that clinical risk is multi-factorial. The obstacle has been mainly cognitive: human reviewers cannotreliably score many dimensions across high incident volumes, so richer assessmenthas not been operationalized at scale.Objective: To extend the traditional three-dimensional FMEA to an eight-dimensional patient-safety risk feature framework, to establish a multi-modellarge language model (LLM) extraction pipeline that scores these dimensionsautomatically, and to demonstrate a variance-aware integer optimization (mean-variance integer programming, MV-IP) that provides a reproducible tie-breakingrule for incident prioritization under extraction uncertainty, rather than improvedrisk coverage.Methods: An 8-dimensional framework covering harm severity, potential harm,frequency, detectability, systemic impact, vulnerable populations, regulatoryrelevance, and economic impact was applied to 213 synthetic and 196 realcurated incident narratives. Three independent LLMs (GPT-5.4, Gemini 3.1 Pro, Grok-4.1 Fast) from different provider families extracted structured risk scores.Inter-model consistency was assessed via ICC(A,1). Among coverage-equivalentselections, MV-IP minimized inter-model variance to give a reproducible prioriti-zation rule. An English-language sensitivity analysis was conducted on 31 AHRQPSNet WebM&M cases.Results: On real cases, seven of eight dimensions reached Fair or betterinter-model reliability (ICC(A,1) 0.53 to 0.83); D5 (Systemic Impact) was theexception at Poor reliability (0.275), driven by little between-case variation ratherthan by wide model disagreement. Reliability was not uniform: two dimensionswere Excellent (D1 actual harm 0.834, D8 economic impact 0.782), two Good,and three only Fair, so some dimensions are more readily extractable than others.The same anchors gave broadly similar results on English-language narratives.When deterministic top-K selection returned several equal-coverage solutions(11 on real cases, total inter-model variance 0.205 to 1.274), MV-IP selected theminimum-disagreement set, replacing ad hoc tie-breaking with an explicit rulewithout improving coverage. Bootstrap resampling found 74% to 90% of per-casevariance estimates stable despite the three-model panel.Conclusions: The eight-dimensional framework operationalizes patient-safetyrisk features that quality teams have considered only implicitly, and three inde-pendent LLM families produced reproducible scores on most dimensions ofcurated narratives. Inter-model agreement, however, measures reproducibilityrather than clinical correctness, and high agreement does not by itself establishthat a score is right; the dimensions that are reliably extractable today (notablyD6 and D8) differ from those that are not yet (D5, and to a lesser degree D4 andD7), which has direct implications for incident-reporting form design. MV-IP con-tributes a reproducible, variance-aware tie-breaking rule rather than improvedcoverage. Validation against expert-prioritized RCA lists and deployment on rawinstitutional incident reports remain the next steps toward clinical use.

9
Registry Forge: an open-source end-to-end pipeline for patient-directed SMART on FHIR registries

Boyce, D.; Premasiri, A.; Sullivan, S.; Levine, B.; Vieira, F. G.

2026-06-03 health informatics 10.64898/2026.06.02.26354637 medRxiv
Top 0.1%
26.5%
Show abstract

Objectives: Patient-directed SMART on FHIR lets registries acquire longitudinal electronic health record data, but the payload requires substantial engineering before use. We present Registry Forge, an open-source pipeline that converts it into research-ready outputs. Materials and Methods: Registry Forge decodes and parses mixed C-CDA, HTML, RTF, PDF, and FHIR inputs, joins records to a canonical patient identifier, and emits a browser-viewable dashboard, an OMOP CDM v5.4 data set, GA4GH Phenopackets v2, a code inventory, and regex extractions of disease-specific narrative content. Results: Applied to the ALS Research Collaborative Study (94 participants, 56 US health systems), it processed 22,686 source files and 1,791 FHIR Bundles (109,599 resources); only 15.0% of files were full C-CDA. Discussion: This pipeline generalizes to any registry acquiring data through patient-directed SMART on FHIR. Conclusion: Registry Forge closes the acquisition-to-analysis gap with no server infrastructure and is openly available.

10
Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

Larsen, M. E.; Campbell, I. M.; Orlando, L. A.; Robinson, P.; Walton, N. A.

2026-05-25 health informatics 10.64898/2026.05.23.26353963 medRxiv
Top 0.1%
23.6%
Show abstract

Background: Accurate extraction of Human Phenotype Ontology (HPO) terms from clinical notes is essential for variant prioritization and genetic diagnosis. Large language models (LLMs) often struggle to balance precision, hallucination avoidance, and ontology mapping accuracy, and prior work has shown that retrieval-based grounding can improve performance for individual models. We hypothesized that real-time ontology grounding through external tools would improve these metrics across heterogeneous LLMs, and we evaluated the Model Context Protocol (MCP), a standardized open framework for integrating external tools, as a vendor-agnostic mechanism for delivering such grounding. Methods: Five LLMs (Claude Sonnet 4.5, GPT-5.1, Gemini 2.5 Pro, Grok 4.1, and Qwen3 30B) extracted HPO terms from four synthetic clinical genetics notes under two conditions: baseline ("No Tools," internal knowledge only) and tool-augmented ("With Tools"), with real-time HPO retrieval delivered through MCP for models with native support and through functionally equivalent native tool-calling interfaces otherwise. Each model performed [&ge;]50 runs per note per condition (>2,000 total runs). Performance was evaluated using Precision, Recall, and F1-score. Outputs were manually adjudicated to classify mapping errors and hallucinations. Results were benchmarked against a commercial EHR-based HPO extraction tool. Results: Tool augmentation significantly improved performance across all models. Mean aggregate F1-score increased from 0.46 (SD 0.22) in the baseline condition to 0.72 (SD 0.15) with tools (p < 0.001). Mapping Error Rate decreased from 40.9% to 7.8% (p < 0.001), and Precision increased from 56% to 90%. Performance gains were observed across all model families, including the open-weight Qwen3 model (F1 0.11[-&gt;]0.50). For inferred phenotypes, F1 improved from 0.20 to 0.34 (p < 0.001) without a significant increase in hallucination rate (p = 0.08). Compared with the commercial benchmark, tool-augmented LLMs achieved higher F1-scores and substantially greater recall for inferred phenotypes. Conclusions: Real-time ontology grounding substantially improves HPO extraction across diverse LLMs by reducing mapping errors and enhancing phenotype inference. The Model Context Protocol provides a standardized, interoperable mechanism for delivering such grounding, supporting reproducible, vendor-agnostic deployment of clinical LLM pipelines in genomic medicine.

11
A Heterogeneous Graph Neural Network Framework for Multi-Horizon Stroke Mortality Prediction

Tharzeen, A.; Vafaei Sadr, A.; Radfar, N.; Hwang, W.; Abedi, V.; Zand, R.

2026-06-10 health informatics 10.64898/2026.06.09.26355176 medRxiv
Top 0.1%
23.4%
Show abstract

Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [&ge;] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.

12
Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

Plasek, J. M.; Li, Y.; Amato, M. G.; Foer, D.; Seger, D. L.; Alzaidi, S.; Zhou, H.; Jackson, G. P.; Bates, D. W.; Zhou, L.

2026-06-01 health informatics 10.64898/2026.05.28.26354362 medRxiv
Top 0.1%
23.2%
Show abstract

Background: Adverse drug events (ADEs) are a critical indicator of patient safety but are often documented only in free-text clinical notes. The potential of recent advances in natural language processing (NLP), particularly generative large language models (LLMs), to identify ADEs remains understudied. This study aimed to compare the performance of multiple LLMs in identifying ADE-Drug relationships in inpatient and ambulatory clinical notes. Methods: We used clinical notes from the 2018 National NLP Clinical Challenge (n2c2) ADE dataset (inpatient; n=505) and from outpatient encounters (n=2,555) between October 1, 2018, and December 31, 2019, at a large academic medical center based in New England. Notes were pre-processed into snippets for model input. Evaluated Models included: GPT-4o, GPT-4o-mini, LLAMA 3.3-70B and their instruction fine-tuned variants (including low-rank adapters for LLAMA). Performance was assessed using both strict and relaxed evaluations (precision, recall, and F1) for all models, followed by manual evaluation (exact semantic match, partial match, missing ADE, drug mention only, not a drug, or wrong) of the two best-performing models. Results: GPT-4o and GPT-4o-mini were the top-performing models among those evaluated. GPT-4o consistently outperformed GPT-4o-mini in ADE extraction across both datasets, with higher F1-scores (0.524 vs. 0.381) and a more balanced precision-recall profile. Both models captured ADEs effectively in explicit and complex clinical contexts, although limitations included misclassification of pre-existing allergies and occasional conflation of therapeutic indications with adverse effects. GPT-4o achieved higher exact match coverage and fewer errors across clinical notes, indicating more reliable performance in both inpatient and ambulatory settings. Conclusion: This work establishes a foundation for integrating LLM methods into real-world drug safety surveillance, with direct implications for improving patient safety.

13
Genosolver: Rare Disease Diagnosis through Holistic Integration of Unstructured Clinical Narratives Using Large Language and Reasoning Models

Islam, T.; Danner, M.; Ziad, Z.; Begemann, M.; Beijer, D.; Lischka, A.; Lausberg, E.; Mattern, L.; Suh, J.; Wittig, P.; Guezel, N.; Schlaich, E.; Karaivanova, R.; D'Augello, S.; Franken, L.; Ruedebusch, J.; Mueller, R.; Perchalla, E.; Zempel, H.; Haag, N.; Eggermann, K.; Eggermann, T.; Meyer, R.; Kraft, F.; Elbracht, M.; Kurth, I.; Krause, J.

2026-06-05 health informatics 10.64898/2026.06.04.26354845 medRxiv
Top 0.1%
22.7%
Show abstract

Background: Molecular medicine has made genetic diagnostics crucial for rare diseases, but the majority of patients remains without diagnosis even after state-of-the-art assessment. Standardized systems for integrating clinical features, such as the Human Phenotype Ontology (HPO), offer assistance, but are often insufficiently detailed and fail to capture crucial clinical parameters such as age at onset, longitudinal changes in symptoms, detailed characteristics of a clinical symptom, or the absence of a feature. Results: We present Genosolver an integrated workflow that utilizes machine learning to address this bottleneck. Using Large Language Models (LLMs) and Large Reasoning Models (LRMs) on unstructured clinical notes and electronic health care data, we generate a workflow that unifies phenotype extraction, generates differential diagnosis, and prioritizes genetic variants from genome data. We evaluated the performance on 233 previously genetically solved cases, where Genosolver ranked the causative gene first in 72% of cases and in 94% of cases in the top 10 gene list, outperforming the existing benchmarking tool Exomiser by 9%. Semi-automated reanalysis of 1,875 unsolved rare disease cases yielded an additional diagnostic rate of 1.7%. Incorporating rich, unstandardized clinical narratives substantially enhanced model performance beyond HPO-only inputs and demonstrated competitive results using data security compliant local models. Conclusion: Integrating unstandardized clinical data with local LLMs and reasoning offers a scalable, data-secure workflow that increases molecular diagnoses in rare diseases.

14
Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges: A Randomized Controlled Trial

Qazi, I. A.; Ali, A.; Khawaja, A. U.; Akhtar, M. J.; Sheikh, A. Z.; Alizai, M. H.

2026-06-02 health informatics 10.64898/2026.06.01.26354596 medRxiv
Top 0.1%
22.4%
Show abstract

As large language models (LLMs) enter clinical workflows, automation bias, the uncritical acceptance of automated output, poses a patient-safety risk. Optimal physician-AI collaboration requires trust calibration, matching scrutiny to LLM recommendation accuracy. We report a randomized trial evaluating a behavioral nudge to mitigate automation bias. Seventy-two AI-trained physicians were randomized to evaluate six vignettes alongside ChatGPT-5.1 recommendations, consulted at each physician's discretion; three contained deliberate, clinically significant errors. The treatment arm received a dual-component nudge: an anchoring cue reporting ChatGPT's benchmark accuracy to calibrate expectations, and a case-specific, selective-attention cue; a numeric accuracy rating and color-coded traffic light, derived from the mean of three distinct-family LLMs. The control group saw recommendations alone; blinded reviewers scored diagnostic reasoning against an expert rubric. The treatment group scored significantly higher (mean difference, 7.6 percentage-points; 95% CI, 1.4-13.9; P=0.016) than the control, suggesting a scalable strategy to preserve clinical judgment in LLM-assisted care. ClinicalTrials.gov registration: NCT07328815.

15
From Charting Burden to Workflow Signal: Retrospective Validation of Documentation-Density Measures for ICU Complexity and Long-Stay Risk

Collier, A.

2026-06-06 health informatics 10.64898/2026.06.04.26354922 medRxiv
Top 0.1%
22.3%
Show abstract

Background Electronic health record documentation patterns may reflect workflow complexity, monitoring intensity, and operational strain in intensive care settings. However, documentation-derived features can be sensitive to local documentation culture, data capture systems, and outcome definitions. Retrospective validation across multiple datasets is therefore needed before these signals are used in workflow intelligence or clinical AI governance tools. Objective To evaluate whether documentation-density and documentation-timing features show reproducible retrospective signal for ICU workflow complexity and long-stay proxy outcomes across de-identified critical care datasets, while distinguishing workflow and long-stay associations from unsupported claims about mortality prediction, burden reduction, or deployment readiness. Methods We synthesized retrospective validation results from de-identified ICU and workflow datasets generated through a prespecified documentation-density validation program. Feature families included Documentation Burden Score style features, Shift-End Documentation Rate style features, documentation reliability style metadata, and all-documentation feature sets where available. Outcomes included long ICU length of stay proxies, mortality where available, and workflow proxy endpoints. Models compared baseline feature sets with enhanced models containing documentation-density or workflow features. Performance was summarized using area under the receiver operating characteristic curve, Brier score where reported, delta AUROC, bootstrap confidence intervals where reported, and label-shuffle controls where available. Results The strongest external long-stay proxy evidence came from the NWICU chartevents analysis, which included 28,612 ICU stays, 20,267 stays with chart events, and 9,619,759 chart events. For ICU length of stay greater than the median, baseline AUROC was 0.5252. Enhanced AUROC was 0.9512 for Documentation Burden Score features, 0.9214 for Shift-End Documentation Rate features, 0.8470 for documentation reliability style features, and 0.9517 for all documentation features. Corresponding label-shuffle enhanced AUROCs were near random, ranging from 0.4897 to 0.5064. For ICU length of stay greater than the 75th percentile, baseline AUROC was 0.5155. Enhanced AUROC was 0.9433 for Documentation Burden Score features, 0.9194 for Shift-End Documentation Rate features, 0.8118 for documentation reliability style features, and 0.9427 for all documentation features, with label-shuffle enhanced AUROCs from 0.4836 to 0.4999. Additional retrospective support was observed in eICU workflow analyses, HiRID first-24-hour documentation-density analyses, MIMIC-IV HF ICU internal analyses, MIMIC-IV-Note metadata extensions, and nursing-chart or lab density proxy analyses. However, cross-institution discrimination transfer was weak without recalibration, and several analyses remained proxy validations rather than final clinical validations. Conclusions Documentation-density and documentation-timing features show promising retrospective signal for ICU workflow complexity and long-stay proxy outcomes, especially in NWICU chartevents and selected internal dataset-specific analyses. These findings support further preregistered, prospective, silent-mode validation of documentation-derived workflow intelligence. They do not establish prospective clinical performance, mortality reduction, clinician burden reduction, autonomous deterioration prediction, or deployment readiness.

16
An extensible laboratory information management system for data harmonization across research centers: The ICTS-Dashboard

King, C. H.; De Dios, I.; Barrick, R.; Berger, S.; Almalvez, M.; Auriga, L.; Delot, E. C.; Xiao, C.; LoTempio, J.; Vilain, E.

2026-06-02 health informatics 10.64898/2026.05.31.26354439 medRxiv
Top 0.1%
20.0%
Show abstract

Background: Collaborative research programs increasingly require infrastructure capable of integrating heterogeneous participant, sample, and experimental data while meeting evolving research needs. Existing tools, including clinical EHRs, REDCap, generic research information management systems, and bespoke database builds, were not designed to operationalize project-specific data models. The Institute for Clinical and Translational Science (ICTS) at the University of California, Irvine (UCI) ICTS-Dashboard fills this need by providing a general purpose research information management system. Methods: We describe the ICTS-Dashboard, built as an open-source, schema-driven platform in which database structure, server-side validation, representational state transfer application programming interfaces (REST APIs), web-based forms, and reproducible exports are all generated from a single versioned java script object notation (JSON) Schema set. The backend is implemented in Django, Django REST Framework, and PostgreSQL; the frontend in React. We instantiate the platform with the Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Data Model and extend it with two case studies: a locally developed biobank table for biospecimen logistics, and an embedded adaptation of the RAG-HPO retrieval-augmented phenotype curation tool. Results: The ICTS-Dashboard deployed at the UCI-GREGoR site supports 37 schema-derived tables and 250 documented API endpoints. It holds metadata for 2,558 participants, 1,237 families, 5,517 biobank entries, 2,466 sequenced biospecimens, and 289 genetic findings, and supports quarterly external data submissions regenerated directly from the database. The biobank extension adds entities the consortium does not standardize while preserving foreign-key linkage to rare disease records; the RAG-HPO module adds curator-mediated phenotype normalization against 19,389 indexed HPO terms. Both were integrated without modifying the GREGoR data model. Conclusion: A version-controlled, machine-readable data model can serve not only as a data sharing standard but as the operational backbone of a research program when paired with schema-governed tooling. The Dashboard's architecture is not intrinsic to a data model or to rare disease; any collaborative research program with a structured, versioned model can adopt the same pattern to reduce implementation overhead and improve reproducibility, harmonization, and findable, accessible, interoperable, and reproducible (FAIR)-aligned accessibility.

17
Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini

Fazeli, M. S.; Kasireddy, E.; Pourrahmat, M.-M.; Chow, C.; Collet, J. P.

2026-05-20 health informatics 10.64898/2026.05.15.26353334 medRxiv
Top 0.1%
18.8%
Show abstract

Background: Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy. Objective: This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality. Methods: Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model's performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value. Results: The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer. Conclusion: This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.

18
A Multi-Agent RAG Framework for Biomedical Literature Analysis

Palem, R. R.; Chen, H.; Yue, Z.

2026-05-29 bioinformatics 10.64898/2026.05.26.727050 medRxiv
Top 0.1%
18.5%
Show abstract

BackgroundThe biomedical literature is expanding at an unprecedented rate, with over 4,000 new articles indexed on PubMed each day. Clinicians and researchers frequently lack the time to review this volume before making decisions. Retrieval-Augmented Generation (RAG) systems attempt to bridge this gap by grounding language model responses in relevant documents, but standard implementations rank all retrieved passages solely by semantic similarity, treating a case report and a meta-analysis as equally authoritative. ObjectiveWe aimed to develop and pilot-evaluate a RAG variant that incorporates evidence quality and publication recency into the retrieval scoring function, and to determine whether these signals improve answer quality on biomedical questions compared with standard cosine similarity RAG and a full-context baseline. MethodsWe developed ET-RAG (Evidence-Temporal RAG), which scores each retrieved chunk using a weighted combination of cosine similarity (50%), evidence quality based on the GRADE hierarchy (30%), and temporal recency (20%). We evaluated ET-RAG alongside two baselines: a full context agent powered by Gemini 2.0 Flash and a standard cosine RAG agent using GPT-4o-mini. All agents were tested on 40 benchmark questions (10 single-choice, 10 multiple-choice, 10 short answer, and 10 long answer) drawn from 10 peer-reviewed Alzheimers disease papers published between 2021 and 2025. ResultsET-RAG achieved the highest scores across all four question categories: single choice (0.90), multiple choice (0.74), short answer (0.92), and long answer (0.89), with a combined average of 0.86. Cosine RAG scored 80%, 0.48, 0.82, and 0.69, respectively (average 0.70), while the full context agent scored 0.60, 0.59, 0.71, and 0.53 (average 0.61). The full context agent, despite having access to the entire corpus through Geminis large context window, struggled with consistent answer extraction and was prone to rate limiting under heavy query loads. A control question on forestry was correctly rejected by all three agents, suggesting no hallucination on this control item. ConclusionsIn this pilot Alzheimers disease benchmark, incorporating evidence quality and recency into RAG retrieval improved answer quality relative to pure cosine similarity retrieval and full-corpus prompting. The evidence-temporal scoring function is lightweight to implement and adds minimal computational overhead to existing vector search pipelines, but broader validation across domains, evidence levels, and stronger retrieval baselines are required before claims of generalizable biomedical reliability can be made.

19
Decomposing growth in a national HL7 CDA clinical document repository

Talvik, H.-A.; Laur, S.; Vilo, J.; Reisberg, S.

2026-05-26 health informatics 10.64898/2026.05.24.26353991 medRxiv
Top 0.1%
18.5%
Show abstract

Longitudinal evaluations of national electronic health record repositories often track document counts alone, obscuring changes in content size, structure and standards implementation. We decomposed growth in the Estonian Health Information System across document counts, per-document size, section-level structure and version uptake in a 10% random population sample of 4.97 million HL7 Clinical Document Architecture Release 2 documents from 147,819 patients, spanning 2012--2019 and four prespecified document types. Growth patterns differed by document type. Inpatient summaries increased 48.5% in total content volume despite a 2.4% decline in document counts. Section presence and within-section content were highly skewed; 44.6% of 892 data locations carried one fixed value. Code-system diversity increased from 45 to 79, and version uptake took years: inpatient summaries reached 80% organisational uptake after a median 44 months (95% CI 11--78). This decomposition can guide extraction pipelines, secondary use and standards governance in CDA- and FHIR-based repositories.

20
Quality and Safety profiles of AI-Generated vs Clinician-Generated Handoffs in Hospital Medicine

Shah, K. P.; Airan Javia, S.; Savage, T.; Bressman, E.

2026-06-08 health informatics 10.64898/2026.06.05.26354946 medRxiv
Top 0.1%
18.2%
Show abstract

End-of-rotation handoffs are critical for patient safety but add to documentation burden for hospitalists. Generative artificial intelligence (AI) may help automate handoff creation using electronic health record data, but its impact on quality and safety is unclear. Methods: We developed an AI handoff tool with a large language model using clinical notes as input and conducted a retrospective evaluation comparing AI-generated and clinician-authored handoffs. Handoffs were assessed across domains of quality and safety through a structured review. Results: Quality ratings were similar between AI and human handoffs (3.7 vs. 3.5, p=0.57). AI-generated handoffs were rated higher for organization (4.4 vs. 4.1, p=0.05) and completeness (4.1 vs. 3.6, p=0.01), but lower for conciseness (3.7 vs. 4.1, p=0.03) and accuracy (4.1 vs. 4.4, p=0.03). Error rates were comparable (0.3/handoff in both groups); however, AI-generated handoffs included inaccuracies (9% of AI errors) and hallucinations (1% of AI errors), while clinician-authored handoffs contained only omissions. Conclusion: Human and AI handoffs have differing error profiles and tradeoffs between completeness and conciseness. Prospective evaluation in clinical workflows is underway.